首页> 外文OA文献 >Jointly Modeling Embedding and Translation to Bridge Video and Language

【2h】

Jointly Modeling Embedding and Translation to Bridge Video and Language

机译：嵌入式翻译与桥梁视频与语言的联合建模

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

Automatically describing video content with natural language is a fundamentalchallenge of multimedia. Recurrent Neural Networks (RNN), which models sequencedynamics, has attracted increasing attention on visual interpretation. However,most existing approaches generate a word locally with given previous words andthe visual content, while the relationship between sentence semantics andvisual content is not holistically exploited. As a result, the generatedsentences may be contextually correct but the semantics (e.g., subjects, verbsor objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memorywith visual-semantic Embedding (LSTM-E), which can simultaneously explore thelearning of LSTM and visual-semantic embedding. The former aims to locallymaximize the probability of generating the next word given previous words andvisual content, while the latter is to create a visual-semantic embedding spacefor enforcing the relationship between the semantics of the entire sentence andvisual content. Our proposed LSTM-E consists of three components: a 2-D and/or3-D deep convolutional neural networks for learning powerful videorepresentation, a deep RNN for generating sentences, and a joint embeddingmodel for exploring the relationships between visual content and sentencesemantics. The experiments on YouTube2Text dataset show that our proposedLSTM-E achieves to-date the best reported performance in generating naturalsentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. We alsodemonstrate that LSTM-E is superior in predicting Subject-Verb-Object (SVO)triplets to several state-of-the-art techniques.

机译：用自然语言自动描述视频内容是多媒体的基本挑战。建模序列动力学的递归神经网络（RNN）在视觉解释上引起了越来越多的关注。然而，大多数现有方法在本地产生具有给定的先前单词和视觉内容的单词，而句子语义与视觉内容之间的关系并未得到全面利用。结果，所生成的句子在上下文上可能是正确的，但是语义（例如，主语，动词或宾语）是不正确的。本文提出了一种新颖的统一框架，即带有视觉语义嵌入的长短期记忆（LSTM-E），它可以同时探索LSTM和视觉语义嵌入的学习。前者的目的是在给定先前单词和视觉内容的情况下最大程度地最大化生成下一个单词的概率，而后者的目的是创建视觉语义嵌入空间，以加强整个句子的语义与视觉内容之间的关系。我们提出的LSTM-E由三个组件组成：用于学习强大的视频表示的2-D和/或3-D深卷积神经网络，用于生成句子的深度RNN，以及用于探索视觉内容与句子语义之间关系的联合嵌入模型。 YouTube2Text数据集上的实验表明，我们提出的LSTM-E在生成自然句方面实现了迄今为止报告的最佳性能：分别以BLEU @ 4和METEOR计分别达到45.3％和31.0％。我们还演示了LSTM-E在预测主语-宾语-宾语（SVO）三胞胎方面优于几种最新技术。

著录项

作者
Pan, Yingwei; Mei, Tao; Yao, Ting; Li, Houqiang; Rui, Yong;
展开▼
作者单位

展开▼
年度 2015
总页数
原文格式 PDF
正文语种
中图分类

相似文献

外文文献
中文文献
专利

1. Efficient Embedded Decoding of Neural Network Language Models in a Machine Translation System [J] . Francisco Zamora-Martinez, Maria Jose Castro-Bleda International Journal of Neural Systems . 2018,第9期

机译：高效嵌入式解码机器翻译系统中的神经网络语言模型
2. Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval [J] . Wessel Kraaij, Jian-Yun Nie, Michel Simard Computational linguistics . 2003,第3期

机译：在跨语言信息检索中嵌入基于Web的统计翻译模型
3. New beam-to-beam joint with concrete embedding for composite bridges Experimental study and finite element modelling [J] . Hugues Somja, SaoSerey Kaing, Alain Lachal Journal of Constructional Steel Research . 2012,第期

机译：新型复合桥梁-梁结合节点的试验研究与有限元建模
4. Jointly Modeling Embedding and Translation to Bridge Video and Language [C] . Yingwei Pan, Tao Mei, Ting Yao, IEEE Conference on Computer Vision and Pattern Recognition . 2016

机译：联合建模嵌入和翻译以桥接视频和语言
5. Jointly Learning Knowledge Graph Embeddings, Fine Grain Entity Types and Language Models [D] . Patel, Rajat Hareshkumar. 2020

机译：联合学习知识图形嵌入，精细谷物实体类型和语言模型
6. Dynamical and Mechanistic Reconstructive Approaches of T Lymphocyte Dynamics: Using Visual Modeling Languages to Bridge the Gap between Immunologists Theoreticians and Programmers [O] . Véronique Thomas-Vaslin, Adrien Six, Jean-Gabriel Ganascia, 2013

机译：T淋巴细胞动力学的动力学和机械重建方法：使用视觉建模语言弥合免疫学家理论学家和程序员之间的差距
7. VideoBERT: A Joint Model for Video and Language Representation Learning [O] . Chen Sun, Austin Myers, Carl Vondrick, 2019

机译：videobert：视频和语言表示学习的联合模型

Jointly Modeling Embedding and Translation to Bridge Video and Language

摘要

著录项

相似文献

相关主题

期刊订阅